在过去的十年中，深度学习已成为人工智能和机器学习的“皇冠上的明珠” [1]，在声学[2]，图像[3]和自然语言处理[4]等方面表现出卓越的性能。 从底层数据中提取复杂模式的深度学习方法已广为人知。 另一方面，图1在现实世界中无处不在，表示对象及其在不同领域中的关系，包括社交网络，电子商务网络，生物学网络，交通网络等。 图还具有复杂的结构，可以包含丰富的基础值[5]。 结果，在过去的几年中，如何利用深度学习方法来分析图形数据已经引起了相当大的研究关注。 这个问题并非易事，因为在将传统深度学习架构应用于图形时存在一些挑战：

Over the past decade, deep learning has become the “crown jewel” of artificial intelligence and machine learning [1], showing superior performance in acoustics [2], images [3] and natural language processing [4], etc. The expressive power of deep learning to extract complex patterns from underlying data is well recognized. On the other hand, graphs1 are ubiquitous in the real world, representing objects and their relationships in varied domains, including social networks, e-commerce networks, biology networks, traffic networks, and so on. Graphs are also known to have complicated structures that can contain rich underlying values [5]. As a result, how to utilize deep learning methods to analyze graph data has attracted considerable research attention over the past few years. This problem is non-trivial because several challenges exist in applying traditional deep learning architectures to graphs: 

图的不规则结构。 与具有清晰网格结构的图像，音频和文本不同，图形具有不规则的结构，因此很难将一些基本的数学运算推广到图形[6]。 例如，为图数据定义卷积和池化操作（这是卷积神经网络（CNN）中的基本操作）并不容易。 这个问题通常被称为几何深度学习问题[7]

Irregular structures of graphs. Unlike images, audio, and text, which have a clear grid structure, graphs have irregular structures, making it hard to generalize some of the basic mathematical operations to graphs [6]. For example, defining convolution and pooling operations, which are the fundamental operations in convolutional neural networks (CNNs), for graph data is not straightforward. This problem is often referred to as the geometric deep learning problem [7] 

图的异构性和多样性。 图本身可能很复杂，包含各种类型和属性。 例如，图可以是异构的或同质的，加权的或未加权的，有符号的或无符号的。 此外，图的任务也有很大不同，范围从以节点为中心的问题（例如节点分类和链接预测）到以图为中心的问题（例如图形分类和图生成）。 这些不同的类型，属性和任务需要不同的模型架构来解决特定问题。

Heterogeneity and diversity of graphs. A graph itself can be complicated, containing diverse types and properties. For example, graphs can be heterogeneous or homogenous, weighted or unweighted, and signed or unsigned. In addition, the tasks of graphs also vary widely, ranging from nodefocused problems such as node classification and link prediction to graph-focused problems such as graph classification and graph generation. These diverse types, properties, and tasks require different model architectures to tackle specific problems. 

大型图。 在大数据时代，真实的图很容易拥有数百万或数十亿个节点和边。 一些著名的例子是社交网络和电子商务网络[8]。 因此，如何设计可伸缩模型，最好是相对于图形大小具有线性时间复杂度的模型，是一个关键问题。

Large-scale graphs. In the big-data era, real graphs can easily have millions or billions of nodes and edges; some well-known examples are social networks and e-commerce networks [8]. Therefore, how to design scalable models, preferably models that have a linear time complexity with respect to the graph size, is a key problem. 

整合跨学科知识。 图通常与其他学科联系在一起，例如生物学，化学和社会科学。 这种跨学科的性质既带来了机遇，也带来了挑战：可以利用领域知识来解决特定问题，但是集成领域知识可以使模型设计复杂化。 例如，当生成分子图时，目标函数和化学约束常常是不可微的。 因此，基于梯度的训练方法不容易应用。

Incorporating interdisciplinary knowledge. Graphs are often connected to other disciplines, such as biology, chemistry, and social sciences. This interdisciplinary nature provides both opportunities and challenges: domain knowledge can be leveraged to solve specific problems but integrating domain knowledge can complicate model designs. For example, when generating molecular graphs, the objective function and chemical constraints are often non-differentiable; therefore gradient-based training methods cannot easily be applied. 

在本文中，我们试图通过全面回顾图上的深度学习方法来填补这一知识空白。具体来说，如图1所示，我们根据现有方法的模型架构和训练策略将其分为五类：图递归神经网络（Graph RNN），图卷积网络（GCN），图自动编码器（GAE），图强化学习（图表RL）和图表对抗方法。我们基于以下高级区别在表1中总结了这些类别的一些主要特征。图RNN通过在节点级别或图级别对状态进行建模来捕获图的递归和顺序模式。 GCN在不规则图结构上定义卷积和读出操作，以捕获常见的局部和全局结构模式。 GAE采取低等级的图结构，并采用无监督的方法进行节点表示学习。图形RL定义了基于图形的动作和奖励，以在遵循约束的同时获得图形任务的反馈。图对抗方法采用对抗训练技术来增强基于图的模型的泛化能力，并通过对抗攻击来测试其健壮性。

In this paper, we try to fill this knowledge gap by comprehensively reviewing deep learning methods on graphs. Specifically, as shown in Figure 1, we divide the existing methods into five categories based on their model architectures and training strategies: graph recurrent neural networks (Graph RNNs), graph convolutional networks (GCNs), graph autoencoders (GAEs), graph reinforcement learning (Graph RL), and graph adversarial methods. We summarize some of the main characteristics of these categories in Table 1 based on the following high-level distinctions. Graph RNNs capture recursive and sequential patterns of graphs by modeling states at either the node-level or the graph-level. GCNs define convolution and readout operations on irregular graph structures to capture common local and global structural patterns. GAEs assume low-rank graph structures and adopt unsupervised methods for node representation learning. Graph RL defines graph-based actions and rewards to obtain feedbacks on graph tasks while following constraints. Graph adversarial methods adopt adversarial training techniques to enhance the generalization ability of graphbased models and test their robustness by adversarial attacks. 

在以下各节中，我们主要通过遵循它们的发展历史以及这些方法解决图形所带来的挑战的各种方式，对这些方法进行全面而详细的概述。 我们还将分析这些模型之间的差异，并深入研究如何组合不同的体系结构。 最后，我们简要概述了这些模型的应用，介绍了几个开放库，并讨论了潜在的未来研究方向。 在附录中，我们提供了一个源代码存储库，分析了本文讨论的各种方法的时间复杂度，并总结了一些常见的应用程序

In the following sections, we provide a comprehensive and detailed overview of these methods, mainly by following their development history and the various ways these methods solve the challenges posed by graphs. We also analyze the differences between these models and delve into how to composite different architectures. Finally, we briefly outline the applications of these models, introduce several open libraries, and discuss potential future research directions. In the appendix, we provide a source code repository, analyze the time complexity of various methods discussed in the paper, and summarize some common applications 

相关作品。 先前的一些调查都与我们的论文有关。 Bronstein等。 [7]总结了一些早期的GCN方法以及流形上的CNN，并通过几何深度学习对它们进行了全面的研究。 Battaglia等。 Lee等人[9]总结了如何使用称为图网络的统一框架将GNN和GCN用于关系推理。 [10]回顾了图的注意力模型，Zhang等。 [11]总结了一些GCN，Sun等。 [12]简要调查了对图形的对抗攻击。 我们的工作与之前的工作不同，因为我们系统地，全面地在图上回顾了不同的深度学习架构，而不是专注于一个特定的分支。 在我们工作的同时，Zhou等人。 [13]和吴等。 [14]从不同的观点和分类研究了这个领域。 具体来说，他们的工作都没有考虑图增强学习或图对抗方法，本文将对此进行介绍。

Related works. Several previous surveys are related to our paper. Bronstein et al. [7] summarized some early GCN methods as well as CNNs on manifolds and studied them comprehensively through geometric deep learning. Battaglia et al. [9] summarized how to use GNNs and GCNs for relational reasoning using a unified framework called graph networks, Lee et al. [10] reviewed the attention models for graphs, Zhang et al. [11] summarized some GCNs, and Sun et al. [12] briefly surveyed adversarial attacks on graphs. Our work differs from these previous works in that we systematically and comprehensively review different deep learning architectures on graphs rather than focusing on one specific branch. Concurrent to our work, Zhou et al. [13] and Wu et al. [14] surveyed this field from different viewpoints and categorizations. Specifically, neither of their works consider graph reinforcement learning or graph adversarial methods, which are covered in this paper 

另一个紧密相关的主题是网络嵌入，旨在将节点嵌入到低维向量空间中[15] – [17]。 网络嵌入与我们的论文之间的主要区别在于，我们专注于如何将不同的深度学习模型应用于图，并且网络嵌入可以被视为使用其中一些模型的具体应用示例（并且它使用非深度学习） 方法）

Another closely related topic is network embedding, aiming to embed nodes into a low-dimensional vector space [15]–[17]. The main distinction between network embedding and our paper is that we focus on how different deep learning models are applied to graphs, and network embedding can be recognized as a concrete application example that uses some of these models (and it uses non-deep-learning methods as well). 

## RNN

诸如门控循环单元（GRU）[30]或长短期记忆（LSTM）[31]之类的循环神经网络（RNN）是建模顺序数据的事实上的标准。 在本节中，我们回顾了可捕获图的递归和顺序模式的图RNN。 图RNN可以大致分为两类：节点级RNN和图级RNN。 主要区别在于模式是位于节点级别并由节点状态建模，还是位于图形级别并由公共图形状态建模。 表3总结了所调查方法的主要特征

Recurrent neural networks (RNNs) such as gated recurrent units (GRU) [30] or long short-term memory (LSTM) [31] are de facto standards in modeling sequential data. In this section, we review Graph RNNs which can capture recursive and sequential patterns of graphs. Graph RNNs can be broadly divided into two categories: node-level RNNs and graph-level RNNs. The main distinction lies in whether the patterns lie at the node-level and are modeled by node states, or at the graph-level and are modeled by a common graph state. The main characteristics of the methods surveyed are summarized in Table 3 

### 节点

图的节点级RNN，也称为图神经网络（GNN）3，可以追溯到“预除学习”时代[23]，[32]。 GNN背后的思想很简单：为了编码图形结构信息，每个节点vi都由一个低维状态向量si表示。 受递归神经网络的推动[33]，采用了状态的递归定义[23]：

Node-level RNNs for graphs, which are also referred to as graph neural networks (GNNs)3, can be dated back to the ”pre-deeplearning” era [23], [32]. The idea behind a GNN is simple: to encode graph structural information, each node vi is represented by a low-dimensional state vector si. Motivated by recursive neural networks [33], a recursive definition of states is adopted [23]: 

（1）

（2）

对于以图为中心的任务，[23]的作者建议添加一个具有唯一属性的特殊节点来表示整个图。 要学习模型参数，可采用以下半监督方法：使用Jacobi方法[34]迭代求解方程式1，到稳定点，使用Almeida-Pineda算法[35]，[36]执行一个梯度下降步骤，以最小化特定于任务的目标函数，例如， 回归任务的预测值和真实性； 然后，重复此过程直到收敛

For graph-focused tasks, the authors of [23] suggested adding a special node with unique attributes to represent the entire graph. To learn the model parameters, the following semi-supervised4 method is adopted: after iteratively solving Eq. (1) to a stable point using the Jacobi method [34], one gradient descent step is performed using the Almeida-Pineda algorithm [35], [36] to minimize a task-specific objective function, for example, the squared loss between the predicted values and the ground-truth for regression tasks; then, this process is repeated until convergence 

在等式1，2中使用两个简单方程式。GNN扮演着两个重要角色。 回想起来，GNN统一了一些用于处理图数据的早期方法，例如递归神经网络和马尔可夫链[23]。 展望未来，GNN的基本概念具有深远的启发：正如后面将要展示的，许多最先进的GCN实际上具有类似于等式1的表述并遵循在直接节点邻域内交换信息的相同框架。 实际上，GNN和GCN可以统一到一些通用框架中，并且GNN等同于使用相同层来达到稳定状态的GCN。 第4节将提供更多讨论。

Using the two simple equations in Eqs. (1)(2), GNN plays two important roles. In retrospect, a GNN unifies some of the early methods used for processing graph data, such as recursive neural networks and Markov chains [23]. Looking toward the future, the general idea underlying GNNs has profound inspirations: as will be shown later, many state-of-the-art GCNs actually have a formulation similar to Eq. (1) and follow the same framework of exchanging information within the immediate node neighborhoods. In fact, GNNs and GCNs can be unified into some common frameworks, and a GNN is equivalent to a GCN that uses identical layers to reach stable states. More discussion will be provided in Section 4. 

即使他们在概念上是重要的，GNN有许多缺点，首先，为了保证等式1有确定解，F（）必须是一个“contraction map”

（3）

直观上，“收缩图”要求任意两个点之间的距离只能在F（·）操作之后“收缩”，这严重限制了建模能力。 其次，由于需要多次迭代才能在梯度下降步骤之间达到稳定状态，因此GNN的计算量很大。 由于这些缺点以及可能缺乏计算能力（例如，图形处理单元，GPU在当时未被广泛用于深度学习）和缺乏研究兴趣，因此GNN并没有成为一般研究的重点。

Intuitively, a “contraction map” requires that the distance between any two points can only “contract” after the F(·) operation, which severely limits the modeling ability. Second, because many iterations are needed to reach a stable state between gradient descend steps, GNNs are computationally expensive. Because of these drawbacks and perhaps a lack of computational power (e.g., the graphics processing unit, GPU, was not widely used for deep learning in those days) and lack of research interests, GNNs did not become a focus of general research 

GNN的显着改进是带有以下修改的门控图序列神经网络（GGS-NN）[24]。 最重要的是，作者替换了公式1中的递归定义而使用GRU，从而消除了“收缩图”要求并支持现代优化技术。 具体来说，等式1修改如下：

A notable improvement to GNNs is gated graph sequence neural networks (GGS-NNs) [24] with the following modifications. Most importantly, the authors replaced the recursive definition in Eq. (1) with a GRU, thus removing the “contraction map” requirement and supporting modern optimization techniques. Specifically, Eq. (1) is adapted as follows: 

（3）

3. 最近，GNN也已用于指代图形数据的通用神经网络。 我们遵循传统的命名约定，并使用GNN来指代这种特定类型的图RNN。

3. Recently, GNNs have also been used to refer to general neural networks for graph data. We follow the traditional naming convention and use GNNs to refer to this specific type of Graph RNNs.

4. 之所以称为半监督，是因为在训练过程中使用了所有图结构以及节点或图标签的某些子集。

4. It is called semi-supervised because all the graph structures and some subset of the node or graph labels is used during training.

其中z是由更新门计算的，es是待更新的候选者，t是伪时间。 其次，作者提出使用几个按顺序运行的网络来产生序列输出，并表明他们的方法可以应用于基于序列的任务，例如程序验证[38]。

where z is calculated by the update gate, es is the candidate for updating, and t is the pseudo time. Second, the authors proposed using several such networks operating in sequence to produce sequence outputs and showed that their method could be applied to sequence-based tasks such as program verification [38].

[25]采取与等式4类似的方法。  但是，SSE不使用GRU进行计算，而是采用随机定点梯度下降法来加快训练过程。 该方案基本上是在使用局部邻域计算稳态节点状态与优化模型参数之间交替进行的，这两种计算都是在随机迷你批次中进行的

SSE [25] took a similar approach as Eq. (4). However, instead of using a GRU in the calculation, SSE adopted stochastic fixedpoint gradient descent to accelerate the training process. This scheme basically alternates between calculating steady node states using local neighborhoods and optimizing the model parameters, with both calculations in stochastic mini-batches 

### 图 形

在本小节中，我们回顾了如何应用RNN捕获图级别的模式，例如动态图的时间模式或不同图粒度级别的顺序模式。 在图级RNN中，不是将单个RNN应用于每个节点以学习节点状态，而是将单个RNN应用于整个图以对图状态进行编码。

In this subsection, we review how to apply RNNs to capture graph-level patterns, e.g., temporal patterns of dynamic graphs or sequential patterns at different levels of graph granularities. In graph-level RNNs, instead of applying one RNN to each node to learn the node states, a single RNN is applied to the entire graph to encode the graph states.

你等。 [26]将图RNN应用于图生成问题。 具体来说，他们采用了两种RNN：一种用于生成新节点，另一种用于以自回归方式为新添加的节点生成边。 他们表明，这种分层RNN架构比传统的基于规则的图生成模型更能从输入图学习，同时具有合理的时间复杂度

 You et al. [26] applied Graph RNNs to the graph generation problem. Specifically, they adopted two RNNs: one to generate new nodes and the other to generate edges for the newly added node in an autoregressive manner. They showed that such hierarchical RNN architectures learn more effectively from input graphs than do the traditional rule-based graph generative models while having a reasonable time complexity 

为了捕获动态图的时间信息，提出了动态图神经网络（DGNN）[27]，该算法使用时间感知LSTM [39]来学习节点表示。 当建立新的边缘时，DGNN使用LSTM更新两个交互节点及其直接邻居的表示，即考虑到一步传播效果。 作者表明，具有时间意识的LSTM可以很好地模拟边缘形成的建立顺序和时间间隔，从而有利于一系列图形应用。

To capture the temporal information of dynamic graphs, dynamic graph neural network (DGNN) [27] was proposed that used a time-aware LSTM [39] to learn node representations. When a new edge is established, DGNN used the LSTM to update the representation of the two interacting nodes as well as their immediate neighbors, i.e., considering the one-step propagation effect. The authors showed that the time-aware LSTM could model the establishing orders and time intervals of edge formations well, which in turn benefited a range of graph applications.

图形RNN也可以与其他架构（例如GCN或GAE）组合。 例如，为了解决图稀疏性问题，RMGCNN [28]将LSTM应用于GCN的结果以逐步重建图，如图2所示。通过使用LSTM，来自图的不同部分的信息可以扩散到整个图上。 远距离而无需GCN层那么多。 动态GCN [29]应用LSTM来收集动态网络中不同时间片的GCN结果，以捕获空间和时间图信息。

 Graph RNN can also be combined with other architectures, such as GCNs or GAEs. For example, aiming to tackle the graph sparsity problem, RMGCNN [28] applied an LSTM to the results of GCNs to progressively reconstruct a graph as illustrated in Figure 2. By using an LSTM, the information from different parts of the graph can diffuse across long ranges without requiring as many GCN layers. Dynamic GCN [29] applied an LSTM to gather the results of GCNs from different time slices in dynamic networks to capture both the spatial and temporal graph information. 

## GRAPH CONVOLUTIONAL NETWORKS  

图卷积网络（GCN）无疑是基于图的深度学习中最热门的话题。 模仿CNN的现代GCN通过设计的卷积和读出功能来学习图形的常见局部和全局结构模式。 因为大多数GCN可以通过反向传播进行特定于任务的损失训练（除了少数例外，例如[74]中的无监督训练方法），所以我们集中在采用的体系结构上。 我们首先讨论卷积运算，然后转向读取运算和其他一些改进。 我们在表4中总结了本文调查的GCN的主要特征。

Graph convolutional networks (GCNs) are inarguably the hottest topic in graph-based deep learning. Mimicking CNNs, modern GCNs learn the common local and global structural patterns of graphs through designed convolution and readout functions. Because most GCNs can be trained with task-specific loss via backpropagation (with a few exceptions such as the unsupervised training method in [74]), we focus on the adopted architectures. We first discuss the convolution operations, then move to the readout operations and some other improvements. We summarize the main characteristics of GCNs surveyed in this paper in Table 4. 

### Convolution Operation

图卷积可以被分为两种：spectral convolution，使用图傅里叶变化或其扩展将node representation -> spectral domain。spatial convolution，考虑到节点的邻居。注意：这两类交叠，例如，使用多项式频谱核（有关详细信息，请参阅第4.1.2节）

#### spectral methods

卷积是CNN中最基本的操作。 然而，用于图像或文本的标准卷积运算不能直接应用于图，因为图缺乏网格结构[6]。 布鲁纳（Bruna）等人。 [40]首先使用图拉普拉斯矩阵L [75]引入了来自谱域的图数据卷积，它在信号处理中起着与傅立叶基础相似的作用[6]。 图卷积运算* G定义如下：

Convolution is the most fundamental operation in CNNs. However, the standard convolution operation used for images or text cannot be directly applied to graphs because graphs lack a grid structure [6]. Bruna et al. [40] first introduced convolution for graph data from the spectral domain using the graph Laplacian matrix L [75], which plays a similar role as the Fourier basis in signal processing [6]. The graph convolution operation, ∗G, is defined as follows: 

（5）

其中，u1和u2是节点上的两个信号，Q是L的特征向量。简单来说，QT将u1，u2转换到谱域中（例如，图傅里叶变换），而Q为逆变换。此定义的变体基于卷积理论，例如，卷积操作的傅里叶变换是它们傅里叶变换的逐元素点乘。

（6）

u‘是输出信号，theta是可学习filter的对角矩阵。卷积网络定义为应用不同的filter到不同的输入输出信号对。

（7）

等式7背后的思想和卷积类似：将输入信号通过一系列可学习的filter来聚合信息，之后跟随非线性变换。通过使用节点特征FV作为输入层并堆叠多个卷积层，整体架构类似于CNN的结构。理论分析表明，图卷积运算的这种定义可以模仿CNN的某些几何特性，我们推荐 读者对[7]进行了全面的调查。

By using the node features FV as the input layer and stacking multiple convolutional layers, the overall architecture is similar to that of a CNN.Theoretical analysis has shown that such a definition of the graph convolution operation can mimic certain geometric properties of CNNs and we refer readers to [7] for a comprehensive survey.

但是，直接使用等式（7）要求学习O（N）参数，这在实践中可能不可行。此外，频谱域中的滤波器可能不位于空间域中，即，每个节点可能会受到所有其他节点的影响，而不仅受到较小区域中的节点的影响。 为了减轻这些问题，布鲁纳（Bruna）等人。 [40]建议使用以下平滑过滤器：

However, directly using Eq. (7) requires learning O(N) parameters, which may not be feasible in practice. Besides, the filters in the spectral domain may not be localized in the spatial domain,i.e., each node may be affected by all the other nodes rather than only the nodes in a small region. To alleviate these problems,
Bruna et al. [40] suggested using the following smoothing filters:

（8）

**其中K是固定插值内核，αl; i; j是可学习的插值系数。 作者还把这种想法推广到没有给出图但使用有监督或无监督方法从原始特征构造图形的情况下[41]。**

where K is a fixed interpolation kernel and αl;i;j are learnable interpolation coefficients. The authors also generalized this idea to the setting where the graph is not given but constructed from raw features using either a supervised or an unsupervised method [41] 

但是，两个基本问题仍未解决。 首先，由于每次计算都需要拉普拉斯矩阵的完整特征向量，因此每次向前和向后遍历的时间复杂度至少为O（N2），更不用说计算特征分解所需的O（N3）复杂度，这意味着 这种方法无法扩展到大型图形。 其次，由于过滤器取决于图的本征基Q，因此无法在具有不同大小和结构的多个图之间共享参数

However, two fundamental problems remain unsolved. First, because the full eigenvectors of the Laplacian matrix are needed during each calculation, the time complexity is at least O(N2) for each forward and backward pass, not to mention the O(N3) complexity required to calculate the eigendecomposition, meaning that this approach is not scalable to large-scale graphs. Second, because the filters depend on the eigenbasis Q of the graph, the parameters cannot be shared across multiple graphs with different sizes and structures 

接下来，我们回顾两行尝试解决这些局限性的工作，然后使用一些通用框架将它们统一起来

Next, we review two lines of works trying to solve these limitations and then unify them using some common frameworks

#### efficient aspect

为了解决效率问题，ChebNet [42]提出如下使用多项式滤波器：

To solve the efficiency problem, ChebNet [42] was proposed to use a polynomial filter as follows:  

（9）

θ0; :::; θK是可学习的参数，K是多项式阶数。 然后，作者不进行特征分解，而是改写了等式。 （9）使用Chebyshev展开式[76]：

where θ0; :::; θK are the learnable parameters and K is the polynomial order. Then, instead of performing the eigendecomposition,the authors rewrote Eq. (9) using the Chebyshev expansion [76]:

（10）

其中Λ〜=2Λ=λmax-I是重新缩放的特征值，λmax是最大特征值，I 2 RN×N是单位矩阵，Tk（x）是k阶的Chebyshev多项式。 由于Chebyshev多项式的正交基础，因此重新缩放是必要的。 利用拉普拉斯矩阵的多项式作为其特征值的多项式（即Lk =QΛkQT）的事实，可对等式中的滤波器进行运算。 （6）可以重写如下：

where Λ~ = 2Λ=λmax − I are the rescaled eigenvalues, λmax is the maximum eigenvalue, I 2 RN×N is the identity matrix, and Tk(x) is the Chebyshev polynomial of order k. The rescaling is necessary because of the orthonormal basis of Chebyshev polynomials. Using the fact that a polynomial of the Laplacian matrix acts as a polynomial of its eigenvalues, i.e., Lk = QΛkQT , the filter operation in Eq. (6) can be rewritten as follows:

（11）

其中u0 = u和u1 = Lu〜。 现在，因为仅需要计算稀疏矩阵L〜的矩阵乘法和一些矢量，所以当使用稀疏矩阵乘法时，时间复杂度变为O（KM），其中M是边的数量，K是多项式的阶数，即 ，时间复杂度相对于边数是线性的。 也很容易看到这样的多项式滤波器严格地是K局部化的：一次卷积后，节点vi的表示将仅受其K阶邻域NK（i）的影响。 有趣的是，该思想在网络嵌入中被独立使用以保持高阶邻近度[77]，为简洁起见，我们省略了其细节。 Kipf and Welling [43]通过仅使用一阶邻居进一步简化了过滤：

with u¯0 = u and u¯1 = Lu ~ . Now, because only the matrix multiplication of a sparse matrix L~ and some vectors need to be calculated, the time complexity becomes O(KM) when using sparse matrix multiplication, where M is the number of edges and K is the polynomial order, i.e., the time complexity is linear with respect to the number of edges. It is also easy to see that such a polynomial filter is strictly K-localized: after one convolution, the representation of node vi will be affected only by its K-step neighborhoods NK(i). Interestingly, this idea is used independently in network embedding to preserve the high-order proximity [77], of which we omit the details for brevity. Kipf and Welling [43] further simplified the filtering by using only the first-order neighbors: 

其中A〜= A + I，即添加自连接作者表明，等式。 （14）是式的特例。 （9）通过设置K = 1并进行一些小的更改。然后，作者认为，如图3所示堆叠足够数量的层具有类似于ChebNet的建模能力，但会带来更好的结果。 ChebNet及其扩展的一个重要见解是它们将频谱图卷积与空间体系结构联系在一起。具体来说，他们表明，当谱卷积函数是多项式或一阶时，谱图卷积等效于空间卷积。另外，等式中的卷积。 （13）与方程中的GNN中的状态定义高度相似。 （1），只是用卷积定义代替了递归定义。从这个方面来看，GNN可以看作是具有大量相同层以达到稳定状态的GCN [7]，即，GNN使用具有固定参数的固定函数来迭代更新节点隐藏状态，直到达到平衡为止；而GCN具有预设的层数，并且每个层包含不同的参数。

where A~ = A + I, i.e., adding a self-connection. The authors showed that Eq. (14) is a special case of Eq. (9) by setting K = 1 with a few minor changes. Then, the authors argued that stacking an adequate number of layers as illustrated in Figure 3 has a modeling capacity similar to ChebNet but leads to better results. An important insight of ChebNet and its extension is that they connect the spectral graph convolution with the spatial architecture. Specifically, they show that when the spectral convolution function is polynomial or first-order, the spectral graph convolution is equivalent to a spatial convolution. In addition, the convolution in Eq. (13) is highly similar to the state definition in a GNN in Eq. (1), except that the convolution definition replaces the recursive definition. From this aspect, a GNN can be regarded as a GCN with a large number of identical layers to reach stable states [7], i.e., a GNN uses a fixed function with fixed parameters to iteratively update the node hidden states until reaching an equilibrium, while a GCN has a preset number of layers and each layer contains different parameters. 

还提出了一些频谱方法来解决效率问题。 例如，不要像等式中那样使用Chebyshev展开。 （10），CayleyNet [44]采用Cayley多项式来定义图卷积：

Some spectral methods have also been proposed to solve the efficiency problem. For example, instead of using the Chebyshev expansion as in Eq. (10), CayleyNet [44] adopted Cayley polynomials to define graph convolutions: 

#### multiple graph

一系列并行的工作着重于将图卷积泛化为任意大小的多个图。 神经FP [46]提出了一种也使用一阶邻居的空间方法：

A parallel series of works has focuses on generalizing graph convolutions to multiple graphs of arbitrary sizes. Neural FPs [46] proposed a spatial method that also used the first-order neighbors:

因为参数Θ可以在不同图形之间共享并且与图形大小无关，所以神经FP可以处理任意大小的多个图形。 请注意， （17）与等式非常相似。 （13）。 然而，代替通过添加归一化项来考虑节点度的影响，神经FP提议针对具有不同度的节点学习不同的参数Θ。 该策略对于较小的图（例如分子图）（即原子作为节点，键作为边缘）效果很好，但可能无法扩展到更大的图。

Because the parameters Θ can be shared across different graphs and are independent of the graph size, Neural FPs can handle multiple graphs of arbitrary sizes. Note that Eq. (17) is very similar to Eq. (13). However, instead of considering the influence of node degree by adding a normalization term, Neural FPs proposed learning different parameters Θ for nodes with different degrees. This strategy performed well for small graphs such as molecular graphs (i.e., atoms as nodes and bonds as edges), but may not be scalable to larger graphs. 

PATCHY-SAN [47]采用了不同的想法。它使用诸如Weisfeiler-Lehman内核[78]之类的图形标注过程分配了唯一的节点顺序，然后使用此预定义的顺序将节点邻居排列在一条直线上。此外，PATCHY-SAN通过从其k阶邻域Nk（i）中选择固定数量的节点，为每个节点vi定义了一个“接受域”。然后采用具有适当归一化的标准一维CNN。使用这种方法，不同图中的节点都具有固定大小和顺序的“接收域”。因此，PATCHY-SAN可以从多个图形中学习，就像普通的CNN可以从多个图像中学习一样。缺点是卷积在很大程度上取决于图形标注过程，而图形标注过程是一个尚未学习的预处理步骤。 LGCN [48]进一步建议通过按字典顺序简化排序过程（即，根据邻居在最后一层HL中的隐藏表示对邻居进行排序）。作者没有使用一个单一的顺序，而是分别对HL的不同渠道进行了排序。 SortPooling [49]采取了类似的方法，但不是对每个节点的邻居进行排序，而是提出对所有节点进行排序（即，对所有邻域使用单一顺序）。尽管这些方法之间存在差异，但是对于图形来说，强制执行一维节点顺序可能不是一种自然选择。

PATCHY-SAN [47] adopted a different idea. It assigned a unique node order using a graph labeling procedure such as the Weisfeiler-Lehman kernel [78] and then arranged node neighbors in a line using this pre-defined order. In addition, PATCHY-SAN defined a “receptive field” for each node vi by selecting a fixed number of nodes from its k-step neighborhoods Nk(i). Then a standard 1-D CNN with proper normalization was adopted. Using this approach, nodes in different graphs all have a “receptive field” with a fixed size and order; thus, PATCHY-SAN can learn from multiple graphs like normal CNNs learn from multiple images. The drawbacks are that the convolution depends heavily on the graph labeling procedure which is a preprocessing step that is not learned. LGCN [48] further proposed to simplify the sorting process by using a lexicographical order (i.e., sorting neighbors based on their hidden representation in the final layer HL). Instead of using a single order, the authors sorted different channels of HL separately. SortPooling [49] took a similar approach, but rather than sorting the neighbors of each node, the authors proposed to sort all the nodes (i.e., using a single order for all the neighborhoods). Despite the differences among these methods, enforcing a 1-D node order may not be a natural choice for graphs. 

DCNN [50]采用了另一种方法，即将图卷积的本征基础替换为扩散基础，即节点的邻域由节点之间的扩散转移概率确定。 具体来说，卷积定义如下：

DCNN [50] adopted another approach by replacing the eigenbasis of the graph convolution with a diffusion-basis, i.e., the neighborhoods of nodes were determined by the diffusion transition probability between nodes. Specifically, the convolution was defined as follows:

其中PK =（P）K是长度为K的扩散过程（即随机游走）的转移概率，K为预设的扩散长度，θl是可学习的参数。 因为仅PK取决于图结构，所以参数θ1可以在任意大小的图之间共享。 但是，计算PK的时间复杂度为O N2K。 因此，该方法无法扩展到大图。

where PK = (P)K is the transition probability of a length-K diffusion process (i.e., random walks), K is a preset diffusion length, and Θl are learnable parameters. Because only PK depends on the graph structure, the parameters Θl can be shared across graphs of arbitrary sizes. However, calculating PK has a time complexity of O N2K ; thus, this method is not scalable to large graphs.

DGCN [51]进一步提出了使用对偶卷积网络联合采用扩散和邻接基础。 具体来说，DGCN使用了两个卷积：一个是等式。 （14），另一个用转移概率的正点向互信息（PPMI）矩阵[79]替换了邻接矩阵，如下所示：

DGCN [51] was further proposed to jointly adopt the diffusion and the adjacency bases using a dual graph convolutional network. Specifically, DGCN used two convolutions: one was Eq. (14), and the other replaced the adjacency matrix with the positive pointwise mutual information (PPMI) matrix [79] of the transition probability as follows: 

并且DP（i; i）= Pj XP（i; j）是XP的对角度矩阵。 然后，通过最小化H和Z之间的均方差来合并这两个卷积。DGCN采用随机游走采样技术来加快过渡概率的计算。 实验表明，这种双重卷积甚至对于单图问题也是有效的。

and DP (i; i) = Pj XP (i; j) is the diagonal degree matrix of XP . Then, these two convolutions were ensembled by minimizing the mean square differences between H and Z. DGCN adopted a random walk sampling technique to accelerate the transition probability calculation. The  experiments demonstrated that such dual convolutions were effective even for single-graph problems.

#### framework

基于以上两点工作，提出了MPNN [52]作为使用消息传递函数的空间域图卷积操作的统一框架：

Based on the above two lines of works, MPNNs [52] were proposed as a unified framework for the graph convolution operation in the spatial domain using message-passing functions:

（21）

其中Fl（·）和Gl（·）分别是要学习的消息函数和顶点更新函数，而ml表示在节点之间传递的“消息”。 从概念上讲，MPNN是一个框架，在该框架中，每个节点都根据其状态发送消息，并根据从直接邻居收到的消息来更新其状态。 作者表明，上述框架已经包含了许多现有方法，例如GGSNNs [24]，Bruna等人[40]，Henaff等。 [41]，神经FP [46]，Kipf和Welling [43]和Kearnes等[55]作为特殊情况。 此外，作者建议添加一个“主”节点，该节点连接到所有节点以加速长距离消息传递，并且他们将隐藏的表示形式拆分为不同的“塔”以提高泛化能力。 作者表明，MPNN的特定变体可以在预测分子特性方面达到最先进的性能。

where Fl(·) and Gl(·) are the message functions and vertex update functions to be learned, respectively, and ml denotes the “messages” passed between nodes. Conceptually, MPNNs are a framework in which each node sends messages based on its states and updates its states based on messages received from the immediate neighbors. The authors showed that the above framework had included many existing methods such as GGSNNs [24], Bruna et al. [40], Henaff et al. [41], Neural FPs [46], Kipf and Welling [43] and Kearnes et al. [55] as special cases. In addition, the authors proposed adding a “master” node that was connected to all the nodes to accelerate the message-passing across long distances, and they split the hidden representations into different “towers” to improve the generalization ability. The authors showed that a specific variant of MPNNs could achieve state-of-the-art performance in predicting molecular properties. 

同时，GraphSAGE [53]的观点与等式21相似。 使用多个汇总函数，如下所示：

Concurrently, GraphSAGE [53] took a similar idea as Eq. (21) using multiple aggregating functions as follows:

（22）

其中，[,]是串联操作，aggregate（）表示聚合函数。作者建议三个聚合函数，逐元素均值，LSTM和最大池化：

（23）

其中Θpool和bpool是要学习的参数，max f·g是元素方向的最大值。对于LSTM聚合功能，由于需要邻居顺序，因此作者采用了简单的随机顺序。

where Θpool and bpool are the parameters to be learned and max f·g is the element-wise maximum. For the LSTM aggregating function, because an neighbors order is needed, the authors adopted a simple random order.

混合模型网络（MoNet）[54]还尝试使用“模板匹配”将现有的GCN模型以及用于流形的CNN统一到一个通用框架中

Mixture model network (MoNet) [54] also tried to unify the existing GCN models as well as CNNs for manifolds into a common framework using “template matching” 

其中u（i; j）是节点对（vi; vj）的伪坐标，Fl k（u）是要学习的参数函数，hl ik是hl i的第k维。 换句话说，Fkl（u）用作组合邻域的加权内核。 然后，MoNet采用了以下高斯内核：

where u(i; j) are the pseudo-coordinates of the node pair (vi; vj), Fl k(u) is a parametric function to be learned, and hl ik is the kth dimension of hl i. In other words, Fkl (u) served as a weighting kernel for combining neighborhoods. Then, MoNet adopted the following Gaussian kernel: 

（25）

其中，μlk和Σlk分别是要学习的均值向量和对角协方差矩阵。 伪坐标是度数，如Kipf和Welling [43]，即

where µl k and Σl k are the mean vectors and diagonal covariance matrices to be learned, respectively. The pseudo-coordinates were degrees as in Kipf and Welling [43], i.e.,

图网络（GNs）[9]为GCN和GNN提出了一个更通用的框架，该框架学习了三组表示： el ij和zl分别表示节点，边和整个图。 这些表示是使用三个聚合和三个更新功能来学习的：

Graph networks (GNs) [9] proposed a more general framework for both GCNs and GNNs that learned three sets of representations: hl i; el ij, and zl as the representation for nodes, edges, and the entire graph, respectively. These representations were learned using three aggregation and three updating functions: 

其中FV（·）; FE（·）和FG（·）分别是节点，边和整个图的相应更新函数，而G（·）表示消息传递函数，其上标表示消息传递方向。 注意，消息传递函数都以集合作为输入，因此它们的参数的长度是可变的，因此这些函数对于输入排列应该是不变的。 一些示例包括按元素求和，均值和最大值。 与MPNN相比，GN引入了边缘表示和整个图形的表示，从而使框架更加通用。 总而言之，卷积运算已从频谱域发展到空间域，并从多步邻域演化到直接邻域。 目前，正在从直接邻居那里收集信息（如等式（14）中所示）并遵循等式的框架。 （21）（22）（27）是图卷积运算的最常见选择。

where FV (·); FE(·), and FG(·) are the corresponding updating functions for nodes, edges, and the entire graph, respectively, and G(·) represents message-passing functions whose superscripts denote message-passing directions. Note that the message-passing functions all take a set as the input, thus their arguments are variable in length and these functions should be invariant to input permutations; some examples include the element-wise summation, mean, and maximum. Compared with MPNNs, GNs introduced the edge representations and the representation of the entire graph, thus making the framework more general. In summary, the convolution operations have evolved from the spectral domain to the spatial domain and from multistep neighbors to the immediate neighbors. Currently, gathering information from the immediate neighbors (as in Eq. (14)) and following the framework of Eqs. (21)(22)(27) are the most common choices for graph convolution operations. 

### Readout

使用图卷积运算，可以学习有用的节点feature来解决许多以节点为中心的任务。 但是，为了处理以图形为中心的任务，需要汇总节点信息以形成图形级表示。 在文献中，这种过程通常称为读出操作。 基于常规和本地邻域，标准CNN会进行多次跨步卷积或合并以逐渐降低分辨率。 由于图形缺乏网格结构，因此无法直接使用这些现有方法。

Using graph convolution operations, useful node features can be learned to solve many node-focused tasks. However, to tackle graph-focused tasks, node information needs to be aggregated to form a graph-level representation. In the literature, such procedures are usually called the readout operations7. Based on a regular and local neighborhood, standard CNNs conduct multiple stride convolutions or poolings to gradually reduce the resolution. Since graphs lack a grid structure, these existing methods cannot be used directly. 

顺序不变。 图读取操作的关键要求是该操作应不依赖于节点顺序，即，如果我们使用两个节点集之间的双射函数来更改节点和边的索引，则整个图的表示不应更改。 例如，一种药物是否可以治疗某些疾病取决于其固有结构。 因此，如果我们使用不同的节点索引表示药物，我们应该得到相同的结果。 注意，因为这个问题与图同构问题有关，其中最著名的算法是拟多项式[80]，所以我们只能找到一个在多项式时间内阶不变的函数，反之亦然，即两个结构不同的函数 图可能具有相同的表示形式。

Order invariance. A critical requirement for the graph readout operations is that the operation should be invariant to the node order, i.e., if we change the indices of nodes and edges using a bijective function between two node sets, the representation of the entire graph should not change. For example, whether a drug can treat certain diseases depends on its inherent structure; thus, we should get identical results if we represent the drug using different node indices. Note that because this problem is related to the graph isomorphism problem, of which the best-known algorithm is quasipolynomial [80], we only can find a function that is orderinvariant but not vice versa in a polynomial time, i.e., even two structurally different graphs may have the same representation. 

#### Statistics

最基本的顺序不变的操作涉及简单统计，例如求和，求平均或最大 - 池[46]，[50]，即

The most basic order-invariant operations involve simple statistics such as summation, averaging or max-pooling [46], [50], i.e.,

其中hG是图G的表示，hL i是最后一层L中节点vi的表示。但是，此类第一时刻统计信息可能不足以代表不同的图。

where hG is the representation of the graph G and hL i is the representation of node vi in the final layer L. However, such first-moment statistics may not be sufficiently representative to distinguish different graphs. 

Kearnes等。 [55]建议使用模糊直方图考虑节点表示的分布[81]。 模糊直方图背后的基本思想是构造几个“直方图bin”，然后计算hL i到这些bin的隶属度，即通过将节点表示视为样本并将它们匹配到一些预定义的模板，最后返回的串联。 最终的直方图。 以此方式，可以区分具有相同的总和/平均/最大但具有不同分布的节点。

Kearnes et al. [55] suggested considering the distribution of node representations by using fuzzy histograms [81]. The basic idea behind fuzzy histograms is to construct several “histogram bins” and then calculate the memberships of hL i to these bins, i.e., by regarding node representations as samples and matching them to some pre-defined templates, and finally return the concatenation of the final histograms. In this way, nodes with the same sum/average/maximum but with different distributions can be distinguished. 

聚合节点表示的另一种常用方法是添加一个完全连接的（FC）层作为最终层[40]，即

Another commonly used approach for aggregating node representation is to add a fully connected (FC) layer as the final layer [40], i.e.,

其中HL 2 RNfL是最终节点表示形式HL的串联，θF C 2 RNfL×foutput是参数，foutput是输出的维数。 等式 （29）可以看作是节点级特征的加权和。 优点之一是该模型可以为不同的节点学习不同的权重。 但是，此功能是以无法保证订单不变性为代价的。

#### Hierarchical Clustering  

众所周知，图显示出丰富的层次结构，而不是节点和图的层次结构二分法[82]，可以通过图4所示的层次聚类方法进行探索。例如，基于密度的聚类聚类[83] 用于Bruna等[40]和多分辨率光谱聚类[84]用于Henaff等[41]。 ChebNet [42]和MoNet [54]采用了另一种贪婪的层次聚类算法Graclus [85]来一次合并两个节点，并采用快速池化方法将节点重新排列为平衡的二叉树。 ECC [63]通过执行特征分解[86]采用了另一种层次聚类方法。 但是，这些分层聚类方法都与图卷积无关（即，它们可以作为预处理步骤执行，并且不能以端到端的方式进行训练）。

Rather than a dichotomy between node and graph level structures, graphs are known to exhibit rich hierarchical structures [82], which can be explored by hierarchical clustering methods as shown in Figure 4. For example, a **density-based agglomerative** clustering [83] was used in Bruna et al. [40] and **multi-resolution spectral clustering** [84] was used in Henaff et al. [41]. ChebNet [42] and MoNet [54] adopted another greedy hierarchical clustering algorithm, Graclus [85], to merge two nodes at a time, along with a fast pooling method to rearrange the nodes into a balanced binary tree. ECC [63] adopted another hierarchical clustering method by performing eigendecomposition [86]. However, these hierarchical clustering methods are all independent of the graph convolutions (i.e., they can be performed as a preprocessing step and are not trained in an end-to-end fashion). 

为了解决这个问题，DiffPool [56]提出了一种与图卷积联合训练的可微层次聚类算法。 具体来说，作者建议使用隐藏表示法在每一层中学习软集群分配矩阵，如下所示：

To solve that problem, DiffPool [56] proposed a differentiable hierarchical clustering algorithm jointly trained with the graph convolutions. Specifically, the authors proposed learning a soft cluster assignment matrix in each layer using the hidden representations as follows:  

（30）

其中Sl 2 RNl×Nl + 1是簇分配矩阵，N1是层l中的簇数，而F（·）是要学习的函数。 然后，可以通过根据Sl取平均值来获得此“粗化”图的节点表示和新的邻接矩阵：

where Sl 2 RNl×Nl+1 is the cluster assignment matrix, Nl is the number of clusters in the layer l and F(·) is a function to be learned. Then, the node representations and the new adjacency matrix for this “coarsened” graph can be obtained by taking the average according to Sl as follows: 

（31）

其中H^l+1通过将图卷积层应用到Hl上之后，将图从每层Nl各节点粗化到Nl+1个节点获得。最初节点数目为N0，最终层NL=1，例如，单节点表示整个图形。因为聚类分配操作是soft，簇之间的连接不是系数的，因此时间复杂度为欸O（N2）

#### Imposing Orders and Others

如第4.1.3节所述，PATCHY-SAN [47]和SortPooling [49]提出了施加节点顺序的想法，然后像在CNN中一样诉诸于标准的1-D池化。 这些方法是否可以保留顺序不变性取决于顺序是如何施加的，这是另一个研究领域，我们请读者参考[87]进行调查。 但是，是否强加节点顺序是图形的自然选择，如果是，则构成最佳节点顺序的最佳方法仍在研究中。

As mentioned in Section 4.1.3, PATCHY-SAN [47] and SortPooling [49] took the idea of imposing a node order and then resorted to standard 1-D pooling as in CNNs. Whether these methods can preserve order invariance depends on how the order is imposed, which is another research field that we refer readers to [87] for a survey. However, whether imposing a node order is a natural choice for graphs and if so, what the best node orders are constitute still on-going research topics. 

除上述方法外，还有一些启发式方法。 在GNN中[23]，作者建议添加一个连接到所有节点的特殊节点以表示整个图。 类似地，GNs [9]提出通过从所有节点和边缘接收消息来直接学习整个图的表示。

In addition to the aforementioned methods, there are some heuristics. In GNNs [23], the authors suggested adding a special node connected to all nodes to represent the entire graph. Similarly, GNs [9] proposed to directly learn the representation of the entire graph by receiving messages from all nodes and edges. 

MPNN采用set2set [88]，这是对seq2seq模型的修改。 具体来说，set2set使用“读取-处理-写入”模型，该模型同时接收所有输入，使用注意机制和LSTM计算内部存储器，然后写入输出。 与seq2seq是顺序敏感的不同，set2set对于输入顺序是不变的。

MPNNs adopted set2set [88], a modification of the seq2seq model. Specifically, set2set uses a “Read-Process-and-Write” model that receives all inputs simultaneously, computes internal memories using an attention mechanism and an LSTM, and then writes the outputs. Unlike seq2seq which is order-sensitive, set2set is invariant to the input order. 

#### Summary

简而言之，诸如平均或求和之类的统计信息是最简单的读出操作，而通过图卷积联合训练的层次聚类算法更先进，但也更加复杂。 还研究了其他方法，例如添加伪节点或强加节点顺序。

In short, statistics such as averaging or summation are the most simple readout operations, while hierarchical clustering algorithms jointly trained with graph convolutions are more advanced but are also more sophisticated. Other methods such as adding a pseudo node or imposing a node order have also been investigated.

### Improvement and Discussion

已经引入了许多技术来进一步改善GCN。 请注意，其中一些方法是通用的，也可以应用于图上的其他深度学习模型。

Many techniques have been introduced to further improve GCNs. Note that some of these methods are general and could be applied to other deep learning models on graphs as well. 

#### Attention Mechanism

在上述GCN中，节点邻域以相等或预定义的权重进行聚合。 但是，邻居的影响可能相差很大。 因此，应该在训练过程中学习它们，而不是预先确定。 受注意力机制[89]的启发，图注意力网络（GAT）[57]通过修改等式13中的卷积运算将注意力机制引入了GCN如下：

In the aforementioned GCNs, the node neighborhoods are aggregated with equal or pre-defined weights. However, the influences of neighbors can vary greatly; thus, they should be learned during training rather than being predetermined. Inspired by the attention mechanism [89], graph attention network (GAT) [57] introduces the attention mechanism into GCNs by modifying the convolution operation in Eq. (13) as follows: 

其中F（·;·）是另一个需要学习的函数，例如多层感知器（MLP）。 为了提高模型的能力和稳定性，作者还建议使用多个独立attention并将其结果进行合并，即，如图5所示的多头注意力机制[89]。GaAN[58]进一步建议针对不同的头和头学习不同的权重。 将这种方法应用于交通预测问题。

where F(·; ·) is another function to be learned such as a multilayer perceptron (MLP). To improve model capacity and stability, the authors also suggested using multiple independent attentions and concatenating the results, i.e., the multi-head attention mechanism [89] as illustrated in Figure 5. GaAN [58] further proposed learning different weights for different heads and applied such a method to the traffic forecasting problem. 

HAN [59]提出了一种针对异构图的两级注意机制，即节点级和语义级注意机制。 具体来说，节点级别的关注机制类似于GAT，但也考虑了节点类型。 因此，它可以分配不同的权重来聚合基于元路径的邻居。 然后，语义层面的注意力了解了不同元路径的重要性，并输出了最终结果。

HAN [59] proposed a two-level attention mechanism, i.e., a node-level and a semantic-level attention mechanism, for heterogeneous graphs. Specifically, the node-level attention mechanism was similar to a GAT, but also considerd node types; therefore, it could assign different weights to aggregating meta-path-based neighbors. The semantic-level attention then learned the importance of different meta-paths and outputed the final results

### Residual and Jumping Connections  

许多研究已经观察到，现有GCN的最合适深度通常非常有限，例如2或3层。 这个问题可能是由于训练深层GCN时遇到的实际困难或过度平滑的问题，即，较深层中的所有节点都具有相同的表示形式[62]，[70]。 为了解决这个问题，可以将类似于ResNet [90]的剩余连接添加到GCN。 例如，Kipf和Welling [43]在方程式中增加了残余连接。 （14）如下：

Many researches have observed that the most suitable depth for the existing GCNs is often very limited, e.g., 2 or 3 layers. This problem is potentially due to the practical difficulties involved in training deep GCNs or the over-smoothing problem, i.e., all nodes in deeper layers have the same representation [62], [70]. To remedy this problem, residual connections similar to ResNet [90] can be added to GCNs. For example, Kipf and Welling [43] added residual connections into Eq. (14) as follows: 

（34）

他们通过实验表明，添加此类残余连接可以使网络深度增加，这与ResNet的结果类似。 列网络（CLN）[60]通过使用以下具有可学习权重的剩余连接采用了类似的思想：

They showed experimentally that adding such residual connections could allow the depth of the network to increase, which is similar to the results of ResNet. Column network (CLN) [60] adopted a similar idea by using the following residual connections with learnable weights: 

（35）

请注意等式（35）与GGS-NN中的GRU非常相似[24]。 区别在于，在CLN中，上标表示层数，不同的层包含不同的参数，而在GGSNN中，上标表示伪时间，并且跨时间步使用一组参数。 受到个性化PageRank的启发，PPNP [61]定义了通过卷积传送到初始层的图卷积：

Note that Eq. (35) is very similar to the GRU as in GGS-NNs [24]. The differences are that in a CLN, the superscripts denote the number of layers, and different layers contain different parameters, while in GGSNNs, the superscript denotes the pseudo time and a single set of parameters is used across time steps. Inspired by personalized PageRank, PPNP [61] defined graph convolutions with teleportation to the initial layer: 

跳跃式知识网络（JK-Nets）[62]提出了另一种体系结构，以将网络的最后一层与所有较低的隐藏层相连，即，通过将所有表示“跳跃”到最终输出，如图6所示。 这样，模型可以学习选择性地利用来自不同层的信息。 JK-Nets的正式表述如下：

Jumping knowledge networks (JK-Nets) [62] proposed another architecture to connect the last layer of the network with all the lower hidden layers, i.e., by “jumping” all the representations to the final output, as illustrated in Figure 6. In this way, the model can learn to selectively exploit information from different layers. Formally, JK-Nets was formulated as follows: 

其中hfinal i是节点vi的最终表示形式，AGGREGATE（·）是聚合函数，L是隐藏层数。 JK-Nets使用了三个类似于GraphSAGE [53]的聚合函数：串联，最大池化和LSTM注意。 实验结果表明，添加跳转连接可以提高多个GCN的性能。

where hfinal i is the final representation for node vi, AGGREGATE(·) is the aggregating function, and L is the number of hidden layers. JK-Nets used three aggregating functions similar to GraphSAGE [53]: concatenation, max-pooling, and the LSTM attention. The experimental results showed that adding jump connections could improve the performance of multiple GCNs. 

#### Edge Feature

前面提到的GCN大多集中在利用节点特征和图结构上。 在本小节中，我们简要讨论如何使用另一个重要的信息源：边缘特征。

The aforementioned GCNs mostly focus on utilizing node features and graph structures. In this subsection, we briefly discuss how to use another important source of information: the edge features. 

对于具有离散值（例如边缘类型）的简单边缘特征，一种直接的方法是为不同的边缘类型训练不同的参数并汇总结果。 例如，神经FP [46]为不同程度的节点训练了不同的参数，这对应于分子图中键类型的隐式边缘特征，然后对结果求和。 CLN [60]在异构图中为不同的边缘类型训练了不同的参数，并对结果取平均值。 边缘条件卷积（ECC）[63]还基于边缘类型训练了不同的参数，并将其应用于图形分类。 关系GCN（R-GCN）[64]通过为不同的关系类型训练不同的权重，对知识图采用了类似的想法。 但是，这些方法仅适用于有限数量的离散边缘特征。

For simple edge features with discrete values such as the edge type, a straightforward method is to train different parameters for different edge types and aggregate the results. For example, Neural FPs [46] trained different parameters for nodes with different degrees, which corresponds to the implicit edge feature of bond types in a molecular graph, and then summed over the results. CLN [60] trained different parameters for different edge types in a heterogeneous graph and averaged the results. Edge-conditioned convolution (ECC) [63] also trained different parameters based on edge types and applied them to graph classification. Relational GCNs (R-GCNs) [64] adopted a similar idea for knowledge graphs by training different weights for different relation types. However, these methods are suitable only for a limited number of discrete edge features. 

DCNN [50]提出了另一种将每个边转换为连接到该边的头尾节点的节点的方法。 进行此转换后，可以将边缘要素视为节点要素

DCNN [50] proposed another method to convert each edge into a node connected to the head and tail node of that edge. After this conversion, edge features can be treated as node features

LGCN [65]构造了一个折线图B 2 R2M×2M，以合并边缘特征，如下所示：

LGCN [65] constructed a line graph B 2 R2M×2M to incorporate edge features as follows:

换句话说，线图中的节点是原始图中的有向边，如果信息可以流过它们在原始图中的相应边，则折线图中的两个节点是连接的。 然后，LGCN采用了两个GCN：一个位于原始图上，另一个位于线图上。

In other words, nodes in the line graph are directed edges in the original graph, and two nodes in the line graph are connected if information can flow through their corresponding edges in the original graph. Then, LGCN adopted two GCNs: one on the original graph and one on the line graph. 

Kearnes等。 [55]提出了一种使用“编织模块”的架构。 具体来说，他们学习了节点和边缘的表示形式，并使用四个不同的功能在每个编织模块中交换了它们之间的信息：节点到节点（NN），节点到边缘（NE），边缘到边缘（EE）和边缘 到节点（EN）：

Kearnes et al. [55] proposed an architecture using a “weave module”. Specifically, they learned representations for both nodes and edges and exchanged information between them in each weave module using four different functions: node-to-node (NN), nodeto-edge (NE), edge-to-edge (EE) and edge-to-node (EN): 

其中el ij是第l层中边缘（vi; vj）的表示，而F（·）是可学习的函数，其下标表示消息传递方向。 通过堆叠多个这样的模块，信息可以通过在节点和边缘表示之间交替传递来传播。 注意，在节点到节点和边缘到边缘功能中，隐式添加了与JK-Nets [62]中相似的跳转连接。 GNs [9]还提出了学习边缘表示并使用消息传递功能更新节点和边缘表示的方法，如等式4所示。 （27）在第4.1.4节中。 在这方面，“编织模块”是GN的特殊情况，并不代表整个图形。

where el ij is the representation of edge (vi; vj) in the lth layer and F(·) are learnable functions whose subscripts represent messagepassing directions. By stacking multiple such modules, information can propagate by alternately passing between node and edge representations. Note that in the node-to-node and edge-to-edge functions, jump connections similar to those in JK-Nets [62] are implicitly added. GNs [9] also proposed learning an edge representation and updating both node and edge representations using message-passing functions as shown in Eq. (27) in Section 4.1.4. In this aspect, the “weave module” is a special case of GNs that does not a representation of the entire graph. 

#### Sampling

为大型图形训练GCN时，关键的瓶颈之一是效率。 如第4.1.4节所示，许多GCN遵循邻域聚合方案。 但是，由于许多实图遵循幂律分布[91]（即，几个节点的度数非常大），因此邻居数可以非常快速地扩展。 为了解决这个问题，已经提出了两种类型的采样方法：邻域采样和逐层采样，如图7所示。

One critical bottleneck when training GCNs for large-scale graphs is efficiency. As shown in Section 4.1.4, many GCNs follow a neighborhood aggregation scheme. However, because many real graphs follow a power-law distribution [91] (i.e., a few nodes have very large degrees), the number of neighbors can expand extremely quickly. To deal with this problem, two types of sampling methods have been proposed: neighborhood samplings and layer-wise samplings, as illustrated in Figure 7. 

在邻域采样中，在计算过程中对每个节点执行采样。 GraphSAGE [53]在训练期间为每个节点统一采样了固定数量的邻居。 PinSage [66]提出了使用图上的随机游走对邻居进行采样以及一些实现方面的改进，包括CPU和GPU之间的协调，映射减少推理管道等等。 PinSage被证明能够处理真实的十亿比例的图形。 StochasticGCN [67]进一步建议通过使用最后一批的历史激活作为控制变量来减少采样方差，从而在理论上保证任意小样本量。

In neighborhood samplings, the sampling is performed for each node during the calculations. GraphSAGE [53] uniformly sampled a fixed number of neighbors for each node during training. PinSage [66] proposed sampling neighbors using random walks on graphs along with several implementation improvements including coordination between the CPU and GPU, a map-reduce inference pipeline, and so on. PinSage was shown to be capable of handling a real billion-scale graph. StochasticGCN [67] further proposed reducing the sampling variances by using the historical activations of the last batches as a control variate, allowing for arbitrarily small sample sizes with a theoretical guarantee. 

FastGCN [68]并未对节点的邻居进行采样，而是采用了不同的策略：它通过将节点解释为i.d.来对每个卷积层中的节点进行采样（即逐层采样）。 样本和图卷积作为概率测度下的积分变换。 FastGCN还显示，通过节点的归一化程度可以减少方差并提高性能。 使[69]进一步建议的采样节点位于以顶层为条件的较低层； 这种方法更具适应性，适用于显着减少方差。

Instead of sampling neighbors of nodes, FastGCN [68] adopted a different strategy: it sampled nodes in each convolutional layer (i.e., a layer-wise sampling) by interpreting the nodes as i.i.d. samples and the graph convolutions as integral transforms under probability measures. FastGCN also showed that sampling nodes via their normalized degrees could reduce variances and lead to better performance. Adapt [69] further proposed sampling nodes in the lower layers conditioned on their top layer; this approach was more adaptive and applicable to explicitly reduce variances. 

图7.不同的节点采样方法，其中蓝色节点表示一批样品，箭头表示采样
指示。 （B）中的红色节点代表历史样本。

Fig. 7. Different node sampling methods, in which the blue nodes indicate samples from one batch and the arrows indicate the sampling
directions. The red nodes in (B) represent historical samples.

#### Inductive setting

GCN的另一个重要方面是它们是否可以应用于归纳设置，即在一组节点或图上进行训练，并在另一组看不见的节点或图上进行测试。 原则上，该目标是通过在不依赖于图的基础上学习给定特征上的映射函数来实现的，并且可以跨节点或图进行传递。 归纳设置在GraphSAGE [53]，GAT [57]，GaAN [58]和FastGCN [68]中得到了验证。 但是，现有的归纳GCN仅适用于具有显式特征的图。 如何在没有显着特征的情况下对图进行归纳学习，通常被称为样本外问题[92]，目前在文献中仍处于开放状态。

Another important aspect of GCNs is that whether they can be applied to an inductive setting, i.e., training on a set of nodes or graphs and testing on another unseen set of nodes or graphs. In principle, this goal is achieved by learning a mapping function on the given features that are not dependent on the graph basis and can be transferred across nodes or graphs. The inductive setting was verified in GraphSAGE [53], GAT [57], GaAN [58], and FastGCN [68]. However, the existing inductive GCNs are suitable only for graphs with explicit features. How to conduct inductive learnings for graphs without explicit features, usually called the out-of-sample problem [92], remains largely open in the literature. 

#### Theoretical Analysis  

为了了解GCN的有效性，已提出了一些理论分析，这些理论分析可分为三类：**以节点为中心的任务，以图为中心的任务和常规分析。**

To understand the effectiveness of GCNs, some theoretical analyses have been proposed that can be divided into three categories: **node-focused tasks, graph-focused tasks, and general analysis.**

对于以节点为中心的任务，Li等人。 [70]首先通过使用一种特殊的拉普拉斯平滑法来分析GCN的性能，这使同一聚类中节点的特征相似。 原始拉普拉斯平滑操作的公式如下：

For node-focused tasks, Li et al. [70] first analyzed the performance of GCNs by using a special form of Laplacian smoothing, which makes the features of nodes in the same cluster similar. The original Laplacian smoothing operation is formulated as follows:

（41）

其中hi和h0 i分别是节点vi的原始特征和平滑特征。 我们可以看到 （41）非常类似于等式13中的图卷积。  基于这一见解，李等人。 还提出了GCN的联合训练和自我训练方法

where hi and h0 i are the original and smoothed features of node vi, respectively. We can see that Eq. (41) is very similar to the graph convolution in Eq. (13). Based on this insight, Li et al. also proposed a co-training and a self-training method for GCNs

最近，吴等人。 [71]从信号处理的角度分析了GCN。 通过将节点特征视为图形信号，他们表明等式（13）基本上是一个固定的低通滤波器。 利用这一见解，他们提出了一种极为简化的图卷积（SGC）架构，方法是消除所有非线性并将学习参数折叠为一个矩阵：

Recently, Wu et al. [71] analyzed GCNs from a signal processing perspective. By regarding node features as graph signals, they showed that Eq. (13) is basically a fixed low-pass filter. Using this insight, they proposed an extremely simplified graph convolution (SGC) architecture by removing all the nonlinearities and collapsing the learning parameters into one matrix:  

作者表明，这种“非深度学习” GCN变体在许多任务上都可以与现有GCN媲美。 Maehara [72]通过证明低通滤波操作并未使GCN具有非线性流形学习能力来增强该结果，并进一步提出了GFNN模型来解决此问题，方法是在图卷积层之后添加MLP。

The authors showed that such a “non-deep-learning” GCN variant achieved comparable performance to existing GCNs in many tasks. Maehara [72] enhanced this result by showing that the low-pass filtering operation did not equip GCNs with a nonlinear manifold learning ability, and further proposed GFNN model to
remedy this problem by adding a MLP after the graph convolution layers.  

对于专注于图的任务，Kipf和Welling [43]以及SortPooling [49]的作者都考虑了GCN与图内核（例如Weisfeiler-Lehman（WL）内核[78]）之间的关系，后者广泛用于图同构。 测试。 他们表明，从概念上讲，GCN是WL内核的概括，因为这两种方法都会迭代地聚合来自节点邻居的信息。 徐等。 [73]通过证明WL内核在区分图结构方面为GCN提供了上限，从而使这一思想形式化。 基于此分析，他们提出了图同构网络（GIN），并表明使用求和和MLP的读出操作可以实现可证明的最大判别力，即在图分类任务中达到最高的训练精度。

**For graph-focused tasks**, Kipf and Welling [43] and the authors of SortPooling [49] both considered the relationship between GCNs and graph kernels such as the Weisfeiler-Lehman (WL) kernel [78], which is widely used in graph isomorphism tests. They showed that GCNs are conceptually a generalization of the WL kernel because both methods iteratively aggregate information from node neighbors. Xu et al. [73] formalized this idea by proving that the WL kernel provides an upper bound for GCNs in terms of distinguishing graph structures. Based on this analysis, they proposed graph isomorphism network (GIN) and showed that a readout operation using summation and a MLP can achieve provably maximum discriminative power, i.e., the highest training accuracy in graph classification tasks 

对于一般分析，Scarselli等。 [93]表明，具有不同激活函数的GCN的Vapnik-Chervonenkis维度（VC-dim）具有与现有RNN相同的规模。 Chen等 [65]分析了线性GCN的优化情况，并表明在某些简化下，任何局部最小值都相对接近于全局最小值。 Verma和Zhang [94]分析了GCN的算法稳定性和泛化范围。 他们表明，如果图卷积滤波器的最大绝对特征值与图大小无关，则单层GCN会满足强烈的均匀稳定性概念。

For general analysis, Scarselli et al. [93] showed that the Vapnik-Chervonenkis dimension (VC-dim) of GCNs with different activation functions has the same scale as the existing RNNs. Chen et al. [65] analyzed the optimization landscape of linear GCNs and showed that any local minimum is relatively close to the global minimum under certain simplifications. Verma and Zhang [94] analyzed the algorithmic stability and generalization bound of GCNs. They showed that single-layer GCNs satisfy the strong notion of uniform stability if the largest absolute eigenvalue of the graph convolution filters is independent of the graph size. 

## GAE

自动编码器（AE）及其变体已广泛应用于无监督的学习任务中[95]，适用于学习图的节点表示形式。 隐含的假设是，**图具有固有的，潜在的非线性低秩结构**。 在本节中，我们首先详细说明图自动编码器，然后介绍图变分自动编码器和其他改进。 表5总结了GAE的主要特征。

The autoencoder (AE) and its variations have been widely applied in unsupervised learning tasks [95] and are suitable for learning node representations for graphs. The implicit assumption is that graphs have an inherent, potentially nonlinear low-rank structure. In this section, we first elaborate graph autoencoders and then introduce graph variational autoencoders and other improvements. The main characteristics of GAEs are summarized in Table 5. 

### 自动编码器

对图形AE的使用源自稀疏自动编码器（SAE）[96]。 基本思想是，**通过将邻接矩阵或其变化视为节点的原始特征，可以利用AE作为降维技术来学习低维节点表示**。 具体来说，SAE采用了以下L2重建损失：

The use of AEs for graphs originated from sparse autoencoder (SAE) [96]. The basic idea is that, by regarding the adjacency matrix or its variations as the raw features of nodes, AEs can be leveraged as a dimensionality reduction technique to learn lowdimensional node representations. Specifically, SAE adopted the following L2-reconstruction loss: 

（43）

通过显示等式43中的L2重构损失，结构深层网络嵌入（SDNE）[97]填补了这个难题。 43实际上对应于节点之间的二阶接近度，即，如果两个节点具有相似的邻域，则它们共享相似的latten表示，这是网络科学中经过充分研究的概念，称为协作过滤或三角形闭合[5]。 受网络嵌入方法的启发，表明一阶邻近度也很重要[108]，SDNE通过添加另一个拉普拉斯特征图项来修改目标函数[75]。

Structure deep network embedding (SDNE) [97] filled in the puzzle by showing that the L2-reconstruction loss in Eq. (43) actually corresponds to the second-order proximity between nodes, i.e., two nodes share similar latten representations if they have similar neighborhoods, which is a well-studied concept in network science known as collaborative filtering or triangle closure [5]. Motivated by network embedding methods showing that the firstorder proximity is also important [108], SDNE modified the objective function by adding another Laplacian eigenmaps term [75] 

（44）

即，如果两个节点直接连接，它们也共享相似的潜在表示。 作者还通过使用邻接矩阵并为零元素和非零元素分配了不同的权重来修改L2重建损失：

i.e., two nodes also share similar latent representations if they are directly connected. The authors also modified the L2- reconstruction loss by using the adjacency matrix and assigning different weights to zero and non-zero elements: 

（45）

受另一类研究的启发，当代著作DNGR [98]采用等式20中定义的正点向互信息（PPMI）[79]矩阵取代了等式43中的转换矩阵P。 这样，原始特征可以与图的某些随机游走概率相关联[109]。 但是，构造输入矩阵的时间复杂度为O（N2），无法扩展到大型图。 通过使用Kipf和Welling [43]提出的GCN作为编码器，GC-MC [99]采用了不同的方法：

Motivated by another line of studies, a contemporary work DNGR [98] replaced the transition matrix P in Eq. (43) with the positive pointwise mutual information (PPMI) [79] matrix defined in Eq. (20). In this way, the raw features can be associated with some random walk probability of the graph [109]. However, constructing the input matrix has a time complexity of O(N2), which is not scalable to large-scale graphs. GC-MC [99] took a different approach by using the GCN proposed by Kipf and Welling [43] as the encoder: 

（46）

并使用简单的双线性函数作为解码器：

and using a simple bilinear function as the decoder:

（47）

使用这种方法，自然可以合并节点特征。 对于没有节点特征的图，使用节点ID的一键编码。 作者在二部图上证明了GC-MC在推荐问题上的有效性

Using this approach, node features were naturally incorporated. For graphs without node features, a one-hot encoding of node IDs was utilized. The authors demonstrated the effectiveness of GC-MC on the recommendation problem on bipartite graphs 

替代重构邻接矩阵或其变体，DRNE [100]提出了另一种修改，该修改通过使用LSTM聚合邻域信息来直接重构低维节点向量。 具体而言，DRNE采用了以下目标函数：

Instead of reconstructing the adjacency matrix or its variations, DRNE [100] proposed another modification that directly reconstructed the low-dimensional node vectors by aggregating neighborhood information using an LSTM. Specifically, DRNE adopted the following objective function: 

（48）

由于LSTM要求其输入为序列，因此作者建议根据节点邻域的程度对其进行排序。 他们还对有较高的degree的节点采用了邻域采样技术，以防止内存过长。 作者证明，这种方法可以保留规则的等价性以及节点的许多中心性度量，例如PageRank [110]。

Because an LSTM requires its inputs to be a sequence, the authors suggested ordering the node neighborhoods based on their degrees. They also adopted a neighborhood sampling technique for nodes with large degrees to prevent an overlong memory. The authors proved that such a method can preserve regular equivalence as well as many centrality measures of nodes, such as PageRank [110]. 

**与上述将节点映射到低维向量的工作不同**，Graph2Gauss（G2G）[101]建议将每个节点编码为高斯分布hi = N（M（i; :); diag（Σ（i; :)）） 捕获节点的不确定性。 具体来说，作者使用从节点属性到高斯分布的均值和方差的深度映射作为编码器：

Unlike the above works that map nodes into a low-dimensional vector, Graph2Gauss (G2G) [101] proposed encoding each node as a Gaussian distribution hi = N (M(i; :); diag (Σ(i; :))) to capture the uncertainties of nodes. Specifically, the authors used a deep mapping from the node attributes to the means and variances of the Gaussian distribution as the encoder: 

（49）

然后，他们使用成对约束来学习模型，而不是使用显式的解码器函数：

Then, instead of using an explicit decoder function, they used pairwise constraints to learn the model:

（50）

换句话说，约束条件确保节点表示之间的KL散度具有与图距离相同的相对顺序。 但是，因为等式（50）难以优化，因此采用了基于能量的损失[112]：

In other words, the constraints ensure that the KL-divergence between node representations has the same relative order as the graph distance. However, because Eq. (50) is hard to optimize, an energy-based loss [112] was adopted as a relaxation:  

（51）

作者进一步提出了一种无偏见的抽样策略，以加快培训过程。

The authors further proposed an unbiased sampling strategy to accelerate the training process.  

### Variational Autoencoders

与上述自动编码器不同，变体自动编码器（VAE）是将降维与生成模型结合在一起的另一种深度学习方法。 它的潜在好处包括容忍噪声和学习平滑表示[113]。 VAE最初是在VGAE [102]中引入图形数据的，其中解码器是一个简单的线性乘积：

Different from the aforementioned autoencoders, variational autoencoders VAEs) are another type of deep learning method that combines dimensionality reduction with generative models. Its potential benefits include tolerating noise and learning smooth representations [113]. VAEs were first introduced to graph data in VGAE [102], where the decoder was a simple linear product:  

（52）

（53）

之后，通过最小化变化的下界来学习模型参数[113]：

Then, the model parameters were learned by minimizing the variational lower bound [113]:

（54）

但是，由于此方法需要重建完整图，因此其时间复杂度为O（N2）。

However, because this approach required reconstructing the full graph, its time complexity is O(N2).  

受SDNE和G2G的推动，DVNE [103]为图数据提出了另一个VAE，该图也将每个节点都表示为高斯分布。 与现有的采用KLdivergence作为度量的工作不同，DVNE使用Wasserstein距离[114]来保留节点相似性的传递性。 与SDNE和G2G相似，DVNE在其目标函数中还保留了一阶和二阶接近度：

Motivated by SDNE and G2G, DVNE [103] proposed another VAE for graph data that also represented each node as a Gaussian distribution. Unlike the existing works that had adopted KLdivergence as the measurement, DVNE used the Wasserstein distance [114] to preserve the transitivity of the nodes similarities.
Similar to SDNE and G2G, DVNE also preserved both the first and second-order proximity in its objective function:  

（55）

一组对应于一阶接近度的排名损失的三元组。 重建损失定义如下：

a set of triples corresponding to the ranking loss of the first-order proximity. The reconstruction loss was defined as follows:

（56）

其中P是过渡矩阵，Z表示从H提取的样本。该框架如图9所示。使用这种方法，可以像使用常规参数重设技巧一样，在常规VAE中最小化目标函数[113]。

where P is the transition matrix and Z represents samples drawn from H. The framework is shown in Figure 9. Using this approach, the objective function can be minimized as in conventional VAEs using the reparameterization trick [113].  

### Improvement and Discussion

#### Adversarial Training

对抗训练方案9被纳入GAE，作为ARGA中的附加正则化术语[104]。 总体架构如图10所示。具体地说，GAE的编码器用作生成器，而鉴别器的目的是区分潜在表示是来自生成器还是来自先验分布。 以这种方式，自动编码器被迫匹配先前的分布作为正则化。 目标函数是：

An adversarial training scheme9 was incorporated into GAEs as an additional regularization term in ARGA [104]. The overall architecture is shown in Figure 10. Specifically, the encoder of GAEs was used as the generator while the discriminator aimed to distinguish whether a latent representation came from the generator or from a prior distribution. In this way, the autoencoder was forced to match the prior distribution as a regularization. The objective function was: 

其中G FV; A是使用Eq中的图卷积编码器的生成器。 （53），D（·）是基于交叉熵损失的判别器，ph是先验分布。 该研究采用了简单的高斯先验，实验结果证明了对抗训练方案的有效性。

where G FV ; A  is a generator that uses the graph convolutional encoder from Eq. (53), D(·) is a discriminator based on the crossentropy loss, and ph is the prior distribution. The study adopted a simple Gaussian prior, and the experimental results demonstrated the effectiveness of the adversarial training scheme. 

同时，NetRA [105]还提出使用生成对抗网络（GAN）[115]来增强图自动编码器的泛化能力。 具体来说，作者使用了以下目标函数：

Concurrently, NetRA [105] also proposed using a generative adversarial network (GAN) [115] to enhance the generalization ability of graph autoencoders. Specifically, the authors used the following objective function: 

其中，LLE是等式中所示的拉普拉斯特征图目标函数。 （44）。 此外，作者采用LSTM作为编码器，以汇总来自类似于Eq的邻域的信息。 （48）。 与其像DRNE [100]中那样仅对直接邻居进行采样并使用度对节点进行排序，作者还使用了随机游走来生成输入序列。 与ARGA相比，NetRA认为GAE中的表示形式是真实的，并采用了随机的高斯噪声，然后是MLP作为生成器。

where LLE is the Laplacian eigenmaps objective function shown in Eq. (44). In addition, the authors adopted an LSTM as the encoder to aggregate information from neighborhoods similar to Eq. (48). Instead of sampling only immediate neighbors and ordering the nodes using degrees as in DRNE [100], the authors used random walks to generate the input sequences. In contrast to ARGA, NetRA considered the representations in GAEs as the ground-truth and adopted random Gaussian noises followed by an MLP as the generator. 

#### Inductive Learning  

与GCN相似，如果将节点属性合并到编码器中，则GAE可以应用于归纳学习设置。 这可以通过使用GCN作为编码器来实现，例如在GC-MC [99]，VGAE [102]和VGAE [104]中，或者通过像G2G [101]中那样直接从节点特征中学习映射函数来实现。 因为仅在学习参数时才使用边缘信息，所以该模型也可以应用于训练期间看不到的节点。 这些工作还表明，尽管GCN和GAE基于不同的体系结构，但可以将它们组合使用，我们认为这是一个有希望的未来方向。

Similar to GCNs, GAEs can be applied to the inductive learning setting if node attributes are incorporated in the encoder. This can be achieved by using a GCN as the encoder, such as in GC-MC [99], VGAE [102], and VGAE [104], or by directly learning a mapping function from node features as in G2G [101]. Because the edge information is utilized only when learning the parameters, the model can also be applied to nodes unseen during training. These works also show that although GCNs and GAEs are based on different architectures, it is possible to use them jointly, which we believe is a promising future direction. 

#### Similarity Measures  

在GAE中，已采用了许多相似性度量，例如L2重建损失，Laplacian特征图和图AE的秩损失，以及图VAE的KL散度和Wasserstein距离。 尽管这些相似性度量基于不同的动机，但是如何为给定的任务和模型架构选择合适的相似性度量仍未研究。 需要更多的研究来了解这些指标之间的根本差异。

In GAEs, many similarity measures have been adopted, for example, L2-reconstruction loss, Laplacian eigenmaps, and the ranking loss for graph AEs, and KL divergence and Wasserstein distance for graph VAEs. Although these similarity measures are based on different motivations, how to choose an appropriate similarity measure for a given task and model architecture remains unstudied. More research is needed to understand the underlying differences between these metrics. 

## 强化学习

深度学习的一个方面尚未讨论，即强化学习（RL），它已被证明在AI任务（例如玩游戏）中很有效[122]。 众所周知，RL善于从反馈中学习，尤其是在处理不可微分的目标和约束时。 在本节中，我们回顾了Graph RL方法。 表6总结了它们的主要特性。

One aspect of deep learning not yet discussed is reinforcement learning (RL), which has been shown to be effective in AI tasks such as playing games [122]. RL is known to be good at learning from feedbacks, especially when dealing with non-differentiable objectives and constraints. In this section, we review Graph RL methods. Their main characteristics are summarized in Table 6. 

GCPN [116]利用RL生成目标导向的分子图，同时考虑了非微分的目标和约束。 具体而言，将图生成建模为添加节点和边的马尔可夫决策过程，并将生成模型视为在图生成环境中运行的RL代理。 通过将代理动作作为链接预测，使用特定于域的以及对抗性的奖励，并使用GCN来学习节点表示，可以使用策略梯度以端到端的方式训练GCPN [123]。

GCPN [116] utilized RL to generate goal-directed molecular graphs while considering non-differential objectives and constraints. Specifically, the graph generation is modeled as a Markov decision process of adding nodes and edges, and the generative model is regarded as an RL agent operating in the graph generation environment.

## Discussion and Conclusion

到目前为止，我们已经回顾了不同的基于图的深度学习架构以及它们的异同。 接下来，在总结本文之前，我们简要讨论它们的应用，实现和未来的方向。

Thus far, we have reviewed the different graph-based deep learning architectures as well as their similarities and differences. Next, we briefly discuss their applications, implementations, and future directions before summarizing this paper.  

### Application

除了诸如节点或图形分类之类的标准图形推理任务外，基于图形的深度学习方法也已应用于多种学科，包括建模社会影响力[133]，建议[28]，[66]，[99]，[134]，化学和生物学[46]，[52]，[55]，[116]，[117]，物理学[135]，[136]，疾病和药物预测[137]-[139]， 基因表达[140]，自然语言处理（NLP）[141]，[142]，计算机视觉[143]-[147]，流量预测[148]，[149]，程序归纳[150]，基于图的求解 NP问题[151]，[152]和多主体AI系统[153]-[155]。

In addition to standard graph inference tasks such as node or graph classification10, graph-based deep learning methods have also been applied to a wide range of disciplines, including modeling social influence [133], recommendation [28], [66], [99], [134], chemistry and biology [46], [52], [55], [116], [117], physics [135], [136], disease and drug prediction [137]–[139], gene expression [140], natural language processing (NLP) [141], [142], computer vision [143]–[147], traffic forecasting [148], [149], program induction [150], solving graph-based NP problems [151], [152], and multi-agent AI systems [153]–[155].  

由于这些应用程序的多样性，因此对这些方法进行彻底的审查超出了本文的范围。但是，我们列出了一些关键灵感。首先，在构造图形或选择架构时，将领域知识纳入模型很重要。例如，基于相对距离构建图形可能适用于交通预测问题，但可能不适用于地理位置也很重要的天气预报问题。其次，基于图的模型通常可以在其他体系结构之上构建，而不是作为独立模型构建。例如，计算机视觉社区通常采用CNN来检测对象，然后将基于图的深度学习用作推理模块[156]。对于NLP问题，可以将GCN用作语法约束[141]。 结果，关键的关键挑战是如何集成不同的模型。 这些应用程序还表明，基于图的深度学习不仅能够挖掘现有图数据背后的丰富价值，而且还有助于将关系数据自然地建模为图，从而极大地扩展了基于图的深度学习模型的适用性。

A thorough review of these methods is beyond the scope of this paper due to the sheer diversity of these applications; however, we list several key inspirations. First, it is important to incorporate domain knowledge into the model when constructing a graph or choosing architectures. For example, building a graph based on the relative distance may be suitable for traffic forecasting problems, but may not work well for a weather prediction problem where the geographical location is also important. Second, a graph-based model can usually be built on top of other architectures rather than as a stand-alone model. For example, the computer vision community usually adopts CNNs for detecting objects and then uses graph-based deep learning as a reasoning module [156]. For NLP problems, GCNs can be adopted as syntactic constraints [141]. As a result, key key challenge is how to integrate different models. These applications also show that graph-based deep learning not only enables mining the rich value underlying the existing graph data but also helps to naturally model relational data as graphs, greatly widening the applicability of graph-based deep learning models. 

### Implementation

最近，已经提供了几个开放库，用于在图上开发深度学习模型。 这些库在表8中列出。我们还收集了一份源代码列表（大部分来自其原始作者）以供本文中讨论的研究之用。 该存储库包含在附录A中。这些开放的实现使您可以轻松学习，比较和改进不同的方法。 一些实现也解决了分布式计算的问题，我们不在本文中讨论。

Recently, several open libraries have been made available for developing deep learning models on graphs. These libraries are listed in Table 8. We also collected a list of source code (mostly from their original authors) for the studies discussed in this paper. This repository is included in Appendix A. These open implementations make it easy to learn, compare, and improve different methods. Some implementations also address the problem of distributed computing, which we do not discuss in this paper 

